perm filename EVALU[DIS,DBL]2 blob sn#209294 filedate 1976-04-07 generic text, type C, neo UTF8
COMMENT ⊗   VALID 00015 PAGES
C REC  PAGE   DESCRIPTION
C00001 00001
C00003 00002	.NSECP(Evaluating AM)
C00006 00003	.SSEC(Judging Performance)
C00012 00004	. SSSEC(AM's Ultimate Discoveries)
C00014 00005	. SSSEC(The Magnitude of AM's Progress)
C00015 00006	. SSSEC(The Quality of AM's Route)
C00016 00007	. SSSEC(The Character of the User-System Interactions)
C00017 00008	. SSSEC(AM's Intuitive Powers)
C00018 00009	. SSSEC(Experients on AM)
C00019 00010	. SSSEC(How to Perform Experiments on AM)
C00020 00011	. SSSEC(Future Implications of this Project)
C00021 00012	.SSEC(Experiments on AM)
C00029 00013	.SSEC(Capabilities and Limitations of AM)
C00030 00014	.SSEC(The Role of the Human)
C00031 00015	.SSEC(Summary of Conclusions)
C00032 ENDMK
C⊗;
.NSECP(Evaluating AM)

This chapter contains discussions "meta" to AM itsself.

< The order has been changed, to protect the innocent >

The first section summarizes what precisely AM itself managed to do, what it tried
to do but failed, and what it "should" have tried but never even noticed.
This is a compact summary of all of AM's results -- and its glaring$$
The judgment that AM should have noticed X is based on hindsight. Namely, X
is now a well-known part of current mathematics. For example, real numbers. $
omissions.

Section {SECNUM}.2 deals with the capabilities and limitations of AM.

.B48

   What are some notable omissions in AM's behavior? Can the user elicit these?

   What concepts can be elicited from AM now? Withing a little tuning/tiny additions?

   What could proabably be done within a couple months of modifications?

   Aside from a total change of domain, what kinds of activities does AM lack
   (e.g., proof capabilitites), what concepts and discoveries are beyyond its design
   limitations.

.E

Next comes an essay about judging the performance of a system like AM.
This is a very hard task, since AM has no "goal". Even using current mathematical
standards, should AM be judged on what it produced, or the quality of the
path which led to those resuls, or the difference between what it started with
and what it finally derived?


Next is an evaluation of the human engineering features (and humans' reactions).
What is the role of the user, both in actuality and ultimately?

Finally, all the conclusions will be gathered together. The next chapter will
apply those conclusions to the general problem of emulating empirical research.

.SSEC(Judging Performance)


One may view AM's activity  as a progression from an initial core of knowledge
to a more sophisticated "final"$$ As has been stressed over and over, AM has no
fixed goal, no "final" state. For practical purposes, however, the totality of
explorations by AM is about the same as the "best run so far"; either of these can be
thought of as defining what is meant by the "final" state of knowledge. $
body of concepts and their facets.
Then each of the following is a reasonable way to measure success, to "judge" AM:


.BN

λλ By AM's ultimate achievements. Examine the list of 
concepts and methods AM developed.
Did AM ever discover anything interesting yet unknown to the user?$$ 
The "user" is a human works with AM interactively, giving it hints, commands,
questions, etc.
Notice that by "new" we mean new to the user, not new to Mankind. 
This might occur if the user were a child, and AM discovered
some elementary facts of arithmetic.
This is not really
so provincial:  mathematicians take "new" to mean new to Mankind, not
new in the Universe.  I feel philosophy slipping in, so this footnote is
terminated. $ Anything new to Mankind?

λλ By the character of the difference between the initial and final states.
Progressing from set theory to number theory is much more impressive than progressing
from two-dimensional geometry to three-dimensional geometry.

λλ By the quality of the route AM took to accomplish these advances:  
How clever, how circuitous,
how many of the detours were quickly identified as such and abandoned?
 
λλ By the character of the User--System interactions: How important is the user's
guidance? How closely must he guide AM? What happens if he doesn't say anything ever?
When he does want to say something, is there an easy way to express that to AM,
and does AM respond well to it?
Given a reasonable kick in the right direction, can AM develop the mini-theories
which the user intended, or at least something equally interesting?

λλ By its intuitive heuristic powers: Does AM believe in "reasonable" conjectures?
How accurately does AM estimate the difficulty of tasks it
is considering?  
Does AM tie together (e.g., as analogous) concepts which are formally unrelated
yet which benefit from such a tie?

λλ By the results of the experiments described in
Section {SECNUM}.{[2] EXPTSSEC}, page {[3] EXPTPAGE}.
How fragile is the worth numbering scheme? The priority of tasks scheme?
How domain-specific are those heuristics really? The set of facets?

λλ By the fact that the kinds of experiments outlined in the next section can
easily be "set up" and performed on AM.
Regardless of the experiments' outcomes, 
the features of AM which allow them to be carried
out at all are worthy of note.

λλ By the implications of this project. What can AM suggest about educating
young mathematicians (and scientists in general)?
What can AM say about doing math (about empirical research in general)?

.E

For each of these measuring criteria, 
a subsection will now be provided, to illustrate (i) a
stunning acheivement and (ii) a stunning failure of AM along each dimension, and
(iii) to
try to objectively characterize AM's performance according to that measure.

. SSSEC(AM's Ultimate Discoveries)

λλ By AM's ultimate achievements. Examine the list of 
concepts and methods AM developed.
Did AM ever discover anything interesting yet unknown to the user?$$ 
The "user" is a human works with AM interactively, giving it hints, commands,
questions, etc.
Notice that by "new" we mean new to the user, not new to Mankind. 
This might occur if the user were a child, and AM discovered
some elementary facts of arithmetic.
This is not really
so provincial:  mathematicians take "new" to mean new to Mankind, not
new in the Universe.  I feel philosophy slipping in, so this footnote is
terminated. $ Anything new to Mankind?


. SSSEC(The Magnitude of AM's Progress)

λλ By the character of the difference between the initial and final states.
Progressing from set theory to number theory is much more impressive than progressing
from two-dimensional geometry to three-dimensional geometry.

. SSSEC(The Quality of AM's Route)

λλ By the quality of the route AM took to accomplish these advances:  
How clever, how circuitous,
how many of the detours were quickly identified as such and abandoned?
 

. SSSEC(The Character of the User-System Interactions)

λλ By the character of the User--System interactions: How important is the user's
guidance? How closely must he guide AM? What happens if he doesn't say anything ever?
When he does want to say something, is there an easy way to express that to AM,
and does AM respond well to it?
Given a reasonable kick in the right direction, can AM develop the mini-theories
which the user intended, or at least something equally interesting?

. SSSEC(AM's Intuitive Powers)

λλ By its intuitive heuristic powers: Does AM believe in "reasonable" conjectures?
How accurately does AM estimate the difficulty of tasks it
is considering?  
Does AM tie together (e.g., as analogous) concepts which are formally unrelated
yet which benefit from such a tie?

. SSSEC(Experients on AM)

λλ By the results of the experiments described in
Section {SECNUM}.{[2] EXPTSSEC}, page {[3] EXPTPAGE}.
How fragile is the worth numbering scheme? The priority of tasks scheme?
How domain-specific are those heuristics really? The set of facets?

. SSSEC(How to Perform Experiments on AM)

λλ By the fact that the kinds of experiments outlined in the next section can
easily be "set up" and performed on AM.
Regardless of the experiments' outcomes, 
the features of AM which allow them to be carried
out at all are worthy of note.

. SSSEC(Future Implications of this Project)

λλ By the implications of this project. What can AM suggest about educating
young mathematicians (and scientists in general)?
What can AM say about doing math (about empirical research in general)?

.SSEC(Experiments on AM)

.EXPTSSEC: SSECNUM;

.EXPTPAGE: PAGE;

The following points are covered for each experiement:

.BN

λλ How it was thought of. Why did it come to mind.

λλ What  will  be gained  by it.  The implications  of some  possible
outcomes.

λλ  How the experiement  was set  up. What preparations/modifications
had to be made. How much time (man-hours) it took.

λλ Description of what happened.

λλ How did this differ from normal? From what was expected?

λλ Conclusions.  What have we  really learned  from this  experiment.
Does it  suggest any new  ones? Does it  imply anything about  how an
AM-like  system  would benefit  from  a better  machine?  a different
domain? Anything about math or teaching of math?

.E

In all, there are six experiements which were performed on AM.


.B

1) Set the interestingness factor of all concepts to 200 initially.
   Result: occasional wanderings, but still bursts of creative driving.
	   Cardinality in about 3 times as many cycles.
   Conclusion: the int. factors of the concepts are useful for deciding
	what to do in close situations, or where few good reasons exist,
	but even 1 good reason is far more influential -- and rightly so!

2) Pick a random candidate to do next, but maintain INTHRESH as it is
	(so the average job-list length is about 20). Also, leave the
	interestingness factors of the concepts as they are normally (0-1000).
   Result: on the average, it will take about 20 times as long to get to
	a given job. On the other hand, several "good" jobs are sprinkled 
	around in the queue, so the performance is cut only by a small factor.
	(timewise). On the other hand, behavior is much less focused, rational.
	Typically, a "good" cand will be chosen, having reasons all of which
	were true 10 cycles ago -- and which are clearly superior to those of
	the last 10 Cands! This is what is so annoying to human onlookers.
   Result: Since AM was frequently working on a low-value task, it was unwilling
	to spend much time or space on it. So the mean time alotted per task
	fell to about 15 seconds (from the typical 30 secs). Thus, the "losers"
	were dealt with quickly, so the detriment to performance was softened.
	In fact, many of these "failed" almost instantly (meaningless ones).
   Conclusion: Picking (on the average) the 20th-best candidate impedes prgress
	by a factor less than 20 (about 7), but it dramaticly degrades the
	"sensibleness" of AM's behavior, the continuity of its actions.
	Humans place a big value on absolute sensibleness, and believe that
	doing something silly 50% of the time is MUCH worse than half as
	productive as always doing the next most logical task.
   Conclusion: having 20 multi-processors simultaneously execute the top 20
	jobs will result in a gain of about 7 in the rate of "big" discoveries.
	That is, not a full factor of 20, nor no gain at all.

3) Pick a random candidate to do next, and adjust INTHRESH so that no
	candidate ever is excluded from the job-list, and set all ints. to 200.
   Result: Many "explosive" tasks were chosen, and the number of new concepts
	increased rapidly. As expected, most of these were real "losers".
	There seemed no rationality to AM's sequence of actions, and it was quite
	boring it watch it floundering so. The typical length of the agenda was
	about 500, and AM's performance was "slowed" by at least a couple orders
	of magnitude. A more subjective measure of its "intelligence" would say
	that it totally collapsed under this random scheme.
   Conclusion: Having 500 processors simultaneously execute all the jobs on 
	the agenda would increase AM's performance only by a factor of 10 or so.
	The truly "intelligent" behavior is AM's plausible sequencing of tasks.

4) Modify the global formula assigining a priority value to each job. Let it still
	be a function of the reasons for the job, but trivialize it: 
	let the priority of a job be defined as simply the number of reasons it has.
	(normalize by multiplying by 100, and cut-off if over 1000).
	This raisies the new question of what to do if several jobs all have the
	same priority. I suppose the answer is to execute them in stack-order
	(most recent first), since this is what AM will do anyway.

5) Eliminate "Equality", and see what AM does.
	The reason for doing this is that AM discovered Cardinality via the
	technique of generalizing the relation "Equality"-of-2-sets. What will
	happen if we eliminate this path? Will AM rederive Equality? Will it get
	to Cardinality via another route? Will it do some set-theoretic things?

6) General classes of expts: modify/add/eliminate certain concepts;
    	modify certain heuristics;
	modify the strategy for choosing the next job/ value assigned to jobs.

.E

.SSEC(Capabilities and Limitations of AM)

.SSEC(The Role of the Human)

.SSEC(Summary of Conclusions)